An Operation Placement and Scheduling Scheme for Cache and Communication Localities in Fine-Grain Parallel Architectures
نویسندگان
چکیده
With increasing on-chip hardware, concurrency is a way to bridge the gap between the computational power demanded by the applications and that afforded by the computer platforms. Although parallel systems are increasingly popular, they remain very difficult to program. In fact, most compilers require the programmer to specify how to partition data or map program code to the system’s processors. To ensure an effective program, cache locality is important because of the large speed gap between microprocessors and memory systems. It is also important to make use of local communication whenever possible, since it is cheaper, faster, and less power hungry than global communication. In order to exploit these locality properties, we present a systematic operation placement and scheduling scheme for fine-grain parallel architectures. The key advantages are two folds: (1) This multiprojection method, which deals with multidimensional parallelism systematically, can alleviate the burden of the programmer in coding and data partitioning. (2) It addresses the memory/communication bandwidth bottleneck, and can lead to faster program execution. On a special design example of the motion estimation block-matching algorithm, which requires the most intensive computation and memory accesses in video coding, our method leads to a reduction of external memory accesses by two to three orders of magnitude.
منابع مشابه
Near fine grain parallel processing using a multiprocessor with MAPLE
Multi-grain parallelizing scheme is one of effective parallelizing schemes which exploits various level parallelism: coarse-grain(macro-dataflow), medium-grain(loop level parallelizing) and near-fine-grain(statements parallelizing) from a sequential program. A multi-processor ASCA is designed for efficient execution of multi-grain parallelizing program. A processing element called MAPLE are mai...
متن کاملNear Fine Grain Parallel Processing Using Static Scheduling on Single Chip Multiprocessors
With the increase of the number of transistors integrated on a chip, efficient use of transistors and scalable improvement of effective performance of a processor are getting important problems. However, it has been thought that popular superscalar and VLIW would have difficulty to obtain scalable improvement of effective performance in future because of the limitation of instruction level para...
متن کامل\threads: a System for the Support of Concurrent Programming". Technical Report
Many parallel applications are implemented using lightweight thread packages. The low overhead associated with user-level thread management encourages programmers to use threads to exploit ne-grain parallelism in an application. Although the overhead of explicit thread management can be very small, there is other overhead associated with lightweight threads: the time required to load data into ...
متن کاملParaWeaver: Performance Evaluation on Programming Models for Fine Grained Threads
There is a trend towards multicore or manycore processors in computer architecture design. In addition, several parallel programming models have been introduced. Some extract concurrent threads implicitly whenever possible, resulting in fine grained threads. Others construct threads by explicit user specifications in the program, resulting in coarse grained threads. How these two mechanisms imp...
متن کاملPredictable Fine-Grained Cache Behavior for Enhanced Simultaneous Multithreading (SMT) Scheduling
By converting thread-level parallelism to instruction level parallelism, Simultaneous Multithreaded (SMT) processors are emerging as effective ways to utilize the resources of modern superscalar architectures. However, the full potential of SMT has not yet been reached as most modern operating systems use existing single-thread or multiprocessor algorithms to schedule threads, neglecting conten...
متن کامل